Red Wine Exploration

by David Vartanian

Abstract

I describe a dataset with almost 1600 types of red wine, in order to understand the meaning of the assigned score.

Introduction

This dataset is provided by Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis, from different universities in Portugal. It provides information like acidity, residual sugar, chlorides, and alcohol among others. I explore the data to find patterns and trends and get the meaning of the given features. More information here.

Univariate Plots Section

Let’s start showing some summary numbers and first histograms to understand individual variables.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Histograms: quality, fixed.acidity, total.sulfur.dioxide, alcohol

These histograms show how the values are distributed in the different variables.

Outliers

There are a few outliers only on the right side.

There are several outliers only on the right side.

There are many outliers only on the right side.

There are just a few outliers only on the right side.

There are just a few outliers on the left side, and many on the right side.

There are many outliers on the right side.

All values are pretty well distributed in the pH variable. There are several outliers on both sides.

There are many outliers only on the right side.

This variable is also well distributed. There are several outliers on both sides.

Univariate Analysis

Dataset Structure

There are 9 continuous variables, 2 discrete variables and one ordered categorical variable: quality.

Main dataset interest

My general question is, how do chemical properties define the quality of the red wine?

There are interesting features in this dataset, each of them describing an important property of the red wine. Density, pH, sulphur dioxide, and sulphates are, in my opinion, the most important ones, in order to measure the quality. Let’s see what we can find by looking at those variables.

pH

This variable indicates the acidity level of the wine. The scale goes from 0 (very acid) to 14 (very basic). But most of red wines are between 3 and 4.

It’s quite surprising that levels of pH are lower on low-quality and high-quality wines.

Density of water

The level of this variable depends on alcohol percentage and sugar.

Density levels are also lower on low-quality and high-quality wines.

Free Sulphure Dioxide

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of the wine.

Again, the levels for this variable are lower for both low-quality and high-quality wines.

Sulphates

Additive contributing with sulphure dioxide gas (S02) levels, acting as an antimicrobial and antioxidant.

Sulphates levels are lower for low-quality and high-quality wines as well.

Variable Transformations

It was not necessary to clean missing values on this dataset. However, I think it is a good idea to apply some transformations to skewed variables.

Transformed Volatile Acidity using log base 10.

Transformed Fixed Acidity using log base 10.

Transformed Total Sulphure Dioxide using log base 10.

Transformed Chlorides using log base 10.

Transformed Residual Sugar using log base 10.

Transformed Sulphates using log base 10.

Transformed Free Sulphure Dioxide using log base 10.

Durability

Using the new variable durability, it’s possible to appreciate the effect of sulphates and free sulphure dioxide.

This variable has only two values: S (short) and L (large), using the median of sulphates and free sulphure dioxide as inflection point.

Bivariate Plots Section

Let’s try to find trends and interesting patterns by comparing two variables.

Fact: Higher quality wines seem to have higher levels of alcohol

Fact: Higher quality wines seem to have lower levels of acidity

Fact: Higher quality wines seem to have lower density

Citric Acid adds freshness flavor to the wine.

Level of acetic acid. Too high levels make an unpleasant vinegar taste.

Bivariate Analysis

Relationships

I’ve found a slightly positive correlation, meaning that density tends to be lower on high-quality wines. However, this correlation is not so important to determine the quality.

I’ve found that levels are mostly low for both variables. I would say that they don’t influence much on the quality as all types of wine have the same level of these two variables.

I’ve found the same here, as they keep levels constantly low.

Levels are always low. However, these two variables seem to be correlated.

Interesting relationships

So far I find only density to be an interesting variable to look at. The rest, chlorides, sulphates, residual sugar and sulphure dioxide don’t seem to be a great influence on wine quality.

Multivariate Plots Section

A quite strong correlation can be observed between these two variables, regarding the quality of wines. Meaning that it’s normal to find lower levels of pH and density on high-quality wines. The lines colours let you see how durable the wine can be respect to alcohol, using the durability variable introduced above. It makes sense to me that wines last longer if they contains more alcohol in addition to sulphates and free sulphure dioxide.

I’ve found here another interesting correlation, which becomes quite obvious if we pay special attention to the meaning of the variables. Density, as I said above, is actually density of water. So, the more alcohol the less water. The coloured lines show that the durability of the wine is lower when the density of water is higher. Does it make sense?

Multivariate Analysis

##     quality      mean_quality   mean_alcohol    mean_density   
##  Min.   :3.00   Min.   :3.00   Min.   : 9.90   Min.   :0.9952  
##  1st Qu.:4.25   1st Qu.:4.25   1st Qu.:10.03   1st Qu.:0.9962  
##  Median :5.50   Median :5.50   Median :10.45   Median :0.9966  
##  Mean   :5.50   Mean   :5.50   Mean   :10.72   Mean   :0.9965  
##  3rd Qu.:6.75   3rd Qu.:6.75   3rd Qu.:11.26   3rd Qu.:0.9970  
##  Max.   :8.00   Max.   :8.00   Max.   :12.09   Max.   :0.9975  
##     mean_ph      mean_citric_acid       n         
##  Min.   :3.267   Min.   :0.1710   Min.   : 10.00  
##  1st Qu.:3.294   1st Qu.:0.1915   1st Qu.: 26.75  
##  Median :3.312   Median :0.2588   Median :126.00  
##  Mean   :3.327   Mean   :0.2715   Mean   :266.50  
##  3rd Qu.:3.366   3rd Qu.:0.3498   3rd Qu.:528.25  
##  Max.   :3.398   Max.   :0.3911   Max.   :681.00

Density & pH

Durability


Final Plots and Summary

Durability & Alcohol

Something very remarkable to keep in mind is what this plot shows: high-quality wines seem to last longer. But the orange line on the top-right corner makes a huge difference. They last much longer when alcohol level is higher.

Citric Acid vs. pH

This is a pretty straight forward correlation. When pH level gets lower (which means that there is more acid) citric acid gets higher. It makes sense, doesn’t it?

Density by Quality Level

I wanted to emphasize this plot again because levels of density look similar for both low-quality and high-quality wines. Or from another perspective, the density of water is higher only on mid-quality wines.


Reflection

I feel that now I have a few extra tips to select new wines to taste. Higher levels of alcohol and acidity, lower levels of density, as well as low levels of residual sugar, chlorides, and sulphates. High levels of alcohol and low level of density were definitely surprising for me. However, I think that the data set needs some more categorical variables and much more data to make better analysis.

For instance, adding columns with usual customers, sommeliers preferences, country of origin, types of grape, altitude of grape crops, and type of cask used to keep them before selling would be of great value to measure wine quality beyond the product itself, but also the background environment and production process.